Preventing "Overfitting" of Cross-Validation Data

نویسنده

  • Andrew Y. Ng
چکیده

Suppose that, for a learning task, we have to select one hypothesis out of a set of hypotheses (that may, for example, have been generated by multiple applications of a randomized learning algorithm). A common approach is to evaluate each hypothesis in the set on some previously unseen cross-validation data, and then to select the hypothesis that had the lowest cross-validation error. But when the cross-validation data is partially corrupted such as by noise, and if the set of hypotheses we are selecting from is large, then \folklore" also warns about \overrtting" the cross-In this paper, we explain how this \overrtting" really occurs, and show the surprising result that it can be overcome by selecting a hypothesis with a higher cross-validation error, over others with lower cross-validation errors. We give reasons for not selecting the hypothesis with the lowest cross-validation error, and propose a new algorithm, LOOCVCV, that uses a computa-tionally eecient form of leave{one{out cross-validation to select such a hypothesis. Finally , we present experimental results for one domain, that show LOOCVCV consistently beating picking the hypothesis with the lowest cross-validation error, even when using reasonably large cross-validation sets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Overfitting Avoidance as Bias

In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that cross-validation is an effective way to choose amongst algorithms for fitting functions to data. In a recent paper, Schaffer (1993) presents experimental evidence disputing these claims. The cur...

متن کامل

On the Dangers of Cross-Validation. An Experimental Evaluation

Cross validation allows models to be tested using the full training set by means of repeated resampling; thus, maximizing the total number of points used for testing and potentially, helping to protect against overfitting. Improvements in computational power, recent reductions in the (computational) cost of classification algorithms, and the development of closed-form solutions (for performing ...

متن کامل

The Probability of Backtest Overfitting

Many investment firms and portfolio managers rely on backtests (i.e., simulations of performance based on historical market data) to select investment strategies and allocate capital. Standard statistical techniques designed to prevent regression overfitting, such as holdout, tend to be unreliable and inaccurate in the context of investment backtests. We propose a general framework to assess th...

متن کامل

Cross-validation in cryo-EM-based structural modeling.

Single-particle cryo-EM is a powerful approach to determine the structure of large macromolecules and assemblies thereof in many cases at subnanometer resolution. It has become popular to refine or flexibly fit atomic models into density maps derived from cryo-EM experiments. These density maps are typically significantly lower in resolution than electron density maps obtained from X-ray diffra...

متن کامل

A Permutation Approach to Validation

We give a permutation approach to validation (estimation of out-sample error). One typical use of validation is model selection. We establish the legitimacy of the proposed permutation complexity by proving a uniform bound on the out-sample error, similar to a VC-style bound. We extensively demonstrate this approach experimentally on synthetic data, standard data sets from the UCI-repository, a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997